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ABSTRACT 

Incoherent dedispersion is a computationally intensive problem that appears fre- 
quently in pulsar and transient astronomy. For current and future transient pipelines, 
dedispersion can dominate the total execution time, meaning its computational speed 
acts as a constraint on the quality and quantity of science results. It is thus critical 
that the algorithm be able to take advantage of trends in commodity computing hard- 
ware. With this goal in mind, we present analysis of the 'direct', 'tree' and 'sub-band' 
dedispersion algorithms with respect to their potential for efficient execution on mod- 
ern graphics processing units (GPUs). We find all three to be excellent candidates, 
and proceed to describe implementations in C for CUDA using insight gained from the 
analysis. Using recent CPU and GPU hardware, the transition to the GPU provides 
a speed-up of 9 x for the direct algorithm when compared to an optimised quad-core 
CPU code. For realistic recent survey parameters, these speeds are high enough that 
further optimisation is unnecessary to achieve real-time processing. Where further 
speed-ups are desirable, we find that the tree and sub-band algorithms are able to 
provide 3-7 x better performance at the cost of certain smearing, memory consump- 
tion and development time trade-offs. We finish with a discussion of the implications 
of these results for future transient surveys. Our GPU dedispersion code is publicly 
available as a C library at: http : // dedisp . googl ecode . com/| 
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1 INTRODUCTION 

With the advent of modern telescopes and digital signal pro- 
cessing back-ends, the time-resolved radio sky has become 
a rich source of astrophysical information. Observations of 
pulsars allow us to probe the nature of ne utron stars (|Lat- 



timer & Prakash 



van den Heuvel 



2004h, stellar populations (Bhattacharya & 



1991|, the galactic environment (Gaensler 
et al.|2008[ ), plasma physics and gravitational waves ( |Lyne 
et al.||2004|). Of equal significance are transient signals such 



as those from rotating radio transients (McLaughlin et al 
|2006[ ) and potentially rare one-off events such as 'Lorimer 
bursts' ( Lorimer et al.|2007 Keane et al.|20lT |, which may 
correspond to previously unknown phenomena. These ob- 
servations all depend on the use of significant computing 
power to search for signals within long, frequency-resolved 
time series. 

As radiation from sources such as pulsars propagates 
to Earth, it interacts with free electrons in the interstellar 
medium. This interaction has the effect of delaying the signal 



in a frequency-dependent manner - signals at lower frequen- 
cies are delayed more than those at higher frequencies. For- 
mally, the observed time delay, At, between two frequencies 
v\ and vi as a result of dispersion by the interstellar medium 
is given by 



At = fcr 



j ■ DM ■ (i/j — 1^2 ), 
4.148808 x 10 3 MHz 2 pe- 



ll) 



where &dm = ;Hbr 
is the dispersion constanlQ and the frequencies are in MHz. 
The parameter DM specifies the dispersion measure along 
the line of sight in pc cm -3 , and is defined as 



DM 



n e dZ, 



(2) 



where n e is the electron number density (cm -3 ) and d is the 
distance to the source (pc). 

Once a time-varying source has been detected, its dis- 
persion measure can be obtained from observations of its 
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1 We note that the dispersion constant is commonly approxi- 
mated in the literature as 1/(2.41 X 10~ 4 ) MHz 2 pc -1 cm 3 s. 
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Figure 1. An illustration of a dispersion trail (top) and its corre- 
sponding dedispersion transform (bottom). The darkest horizon- 
tal slice in the dedispersion transform gives the correctly dcdis- 
persed time series. 



phase as a function of frequency; this in turn allows the 
distance to the object to be calculated via equation |2j, as- 
suming one has a model for the Galactic electron density 
n e . When searching for new sources, however, one does not 
know the distance to the object. In these cases, the disper- 
sion measure must be guessed prior to looking for a signal. 
To avoid excessive smearing of signals in the time series, and 
a consequent loss of signal-to-noise, survey pipelines typi- 
cally repeat the process for many trial dispersion measures. 
This process is referred to as a dedispersion transform. An 
example of the dedispersion transform is shown in Fig. [I] 

Computing the dedispersion transform is a computa- 
tionally expensive task: a simple approach involves a sum- 
mation across a band of, e.g., ~ 10 frequency channels for 
each of ~ 10 3 (typically) dispersion measures, for each time 
sample. Given modern sampling intervals of 0(64/is), com- 
puting this in real-time is a challenging task, especially if 
the process must be repeated for multiple beams. The pro- 
hibitive cost of real-time dedispersion has traditionally ne- 
cessitated that pulsar and transient survey projects use of- 
fline processing. 

In this paper we consider three ways in which computa- 
tion of the dedispersion transform may be accelerated, en- 
abling real-time processing at low cost. First, in Section [2] 
we demonstrate how modern many-core computing hard- 
ware in the form of graphics processing units [GPUs; see 



Fluke et al. (2011) for an introduction] can provide an or- 
der of magnitude more performance over a multi-core cen- 
tral processing unit (CPU) when dedispersing 'directly'. The 
use of GPUs for incoherent dedispersion is not an entirely 



new idea. Dodson et al. (20101 introduced an implementa- 



tion of such a system as part of the CRAFT survey. |Magro| 
et al. (20111 described a similar approach and how it may 



be used to construct a GPU-based real-time transient detec- 
tion pipeline for modest fractional bandwidths, demonstrat- 
ing that their GPU dedisperser could out perform a generic 



code by two orders of magnitude. In this work we provide 
a thorough analysis of both the direct incoherent dedisper- 
sion algorithm itself and the details of its implementation 
on GPU hardware. 

In Section [3] we then consider the use of the 'tree' al- 
gorithm, a (theoretically) more efficient means of comput- 
ing the dedispersion transform. To our knowledge, this tech- 
nique has not previously been implemented on a GPU. We 
conclude our analysis of dedispersion algorithms in Section 
[4] with a discussion of the 'sub-band' method, a derivative 
of the direct method. 

In section [5] we report accuracy and timing benchmarks 
for the three algorithms and compare them to our theoretical 
results. Finally, we present a discussion of our results, their 
implications for future pulsar and transient surveys and a 
comparison with previous work in Section [6] 



2 DIRECT DEDISPERSION 
2.1 Introduction 

The direct dedispersion algorithm operates by directly sum- 
ming frequency channels along a quadratic dispersion trail 
for each time sample and dispersion measure. In detail, the 
algorithm computes an array of dedispersed time series D 
from an input dataset A according to the following equation: 



D d 



v,t+&t(d,v) i 



(3) 



where the subscripts d, t and u represent dispersion measure, 
time sample and frequency channel respectively, and N v is 
the total number of frequency channels. Note that through- 
out this paper we use the convention that £^ means the 
sum over the range i = to i = N — 1. The function At(d, v) 
is a discretized version of equation |TJ and gives the time de- 
lay relative to the start of the band in whole time samples 
for a given dispersion measure and frequency channel: 



~At~ 



AT{u) - Ar y(u + uAuy 

At(d, v) = round (DU(d)AT(u)) . 



(4) 
(5) 



where Ar is the time difference in seconds between two ad- 
jacent samples (the sampling interval), is the frequency 
in MHz at the start of the band, Av is the frequency differ- 
ence in MHz between two adjacent channels and the function 
round(:r) means x rounded to the nearest integer. The func- 
tion or array DM(d) is used to specify the dispersion mea- 
sures to be computed. Note that the commonly-used central 
frequency, v c , and bandwidth, BW, parameters are related 
by BW = NuAv and v c = u + |BW. 

After dedispersion, the dedispersed time series Dd,t can 
be searched for periodic or transient signals. 

When dedispersing at large DM, the dispersion of a sig- 
nal can be such that it is smeared significantly within a 
single frequency channel. Specifically, this occurs when the 
gradient of a dispersion curve on the time-frequency grid is 
less than unity (i.e., beyond the 'diagonal'). Once this effect 
becomes significant, it becomes somewhat inefficient to con- 
tinue to dedisperse at the full native time resolution. One op- 
tion is to reduce the time resolution by a factor of two when 
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the DM exceeds the diagonal by adding adjacent pairs of 
time samples. This process is then repeated at 2x the diag- 
onal, 4x etc. We refer to this technique as 'time-scrunching'. 
The use of time-scrunching will reduce the overall computa- 
tional cost, but can also slightly reduce the signal-to-noise 
ratio if the intrinsic pulse width is comparable to that of the 
dispersion smear. 



2.2 Algorithm analysis 

The direct dedispersion algorithm's summation over N v fre- 
quency channels for each of N t time samples and Num dis- 
persion measures gives it a computational complexity of 



T diroct = 0(N t N„NoM). 



(6) 



The algorithm was analysed previously for many-core archi- 
tectures in Barsdell, Barnes & Fluke (2010). The key find- 
ings were: 

(i) the algorithm is best parallelised over the "embarass- 
ingly parallel" dispersion-measure (d) and time (t) dimen- 
sions, with the sum over frequency channels (v) being per- 
formed sequentially, 

(ii) the algorithm has a very high theoretical arithmetic 
intensity, of the same magnitude as the number of dispersion 
measures computed [typically 0(100 — 1000)], and 

(iii) the memory access patterns generally exhibit reason- 
able locality, but their non-trivial nature may make it diffi- 
cult to achieve a high arithmetic intensity. 

While overall the algorithm appears well-positioned to take 
advantage of massively parallel hardware, we need to per- 
form a deeper analysis to determine the optimal implemen- 
tation strategy. The pattern in which memory is accessed is 
often critical to performance on massively-parallel architec- 
tures, so this is where we now turn our attention. 

While the d dimension involves a potentially non-linear 
mapping of input indices to output indices, the t dimension 
maintains a contiguous mapping from input to output. This 
makes the t dimension suitable for efficient memory access 
operations via spatial caching, where groups of adjacent par- 
allel threads access memory all at once. This behaviour typ- 
ically allows a majority of the available memory bandwidth 
to be exploited. 

The remaining memory access issue is the potential use 
of temporal caching to increase the arithmetic intensity of 
the algorithm. Dedispersion at similar DMs involves access- 
ing similar regions of input data. By pre-loading a block of 
data into a shared cache, many DMs could be computed be- 
fore needing to return to main memory for more data. This 
would increase the arithmetic intensity by a factor propor- 
tional to the size of the shared cache, potentially providing 
a significant performance increase, assuming the algorithm 
was otherwise limited by available memory bandwidth. The 
problem with the direct dedispersion algorithm, however, is 
its non-linear memory access pattern in the d dimension. 
This behaviour makes a caching scheme difficult to devise, 
as one must account for threads at different DMs needing 
to access data at delayed times. Whether temporal caching 
can truly be used effectively for the direct dedispersion al- 
gorithm will depend on details of the implementation. 



2.3 Implementation Notes 

When discussing GPU implementations throughout this pa- 
per, we use the terms 'Fermi' and 'pre-Fermi' GPUs to mean 
GPUs of the NVIDIA Fermi architecture and those of older 
architectures respectively. We consider both architectures in 
order to study the recent evolution of GPU hardware and 
gain insight into the future direction of the technology. 

We implemented the direct dedispersion algorithm us- 
ing the C for CUDA platform^] As suggested by the anal- 
ysis in Section |2.2| the algorithm was parallelised over the 
dispersion-measure and time dimensions, with each thread 
summing all N v channels sequentially. During the analysis 
it was also noted that the algorithm's memory access pat- 
tern exhibits good spatial locality in the time dimension, 
with contiguous output indices mapping to contiguous in- 
put indices. We therefore chose time as the fastest-changing 
(i.e., x) thread dimension, such that reads from global mem- 
ory would always be from contiguous regions with a unit 
stride, maximising throughput. The DM dimension was con- 
sequently mapped to the second (i.e., y) thread dimension. 

While the memory access pattern is always contigu- 
ous, it is not always aligned. This is a result of the delays, 
At(d, v), introduced in the time dimension. At all non-zero 
DMs, the majority of memory accesses will begin at arbi- 
trary offsets with respect to the internal alignment bound- 
aries of the memory hardware. The consequence of this is 
that GPUs that do not have built-in caching support may 
need to split the memory requests into many smaller ones, 
significantly impacting throughput to the processors. In or- 
der to avoid this situation, we made use of the CPU's texture 
memory, which does support automatic caching. On pre- 
Fermi GPU hardware, the use of texture memory resulted 
in a speed-up of around 5x compared to using plain device 
memory, highlighting the importance of understanding the 
details of an algorithm's memory access patterns when using 
these architectures. With the advent of Fermi-class GPUs, 
however, the situation has improved significantly. These de- 
vices contain an LI cache that provides many of the advan- 
tages of using texture memory without having to explicitly 
refer to a special memory area. Using texture memory on 
Fermi-class GPUs was slightly slower than using plain de- 
vice memory (with LI cache enabled), as suggested in the 
CUDA programming guidfQ 

Input data with fewer bits per sample than the ma- 
chine word size (currently assumed to be 32 bits) were 
handled using bit-shifting and masking operations on the 
GPU. It was found that a convenient format for work- 
ing with the input data was to transpose the input 
from time-major order to frequency-major order by whole 
words, leaving consecutive frequency channels within each 
word. For example, for the case of two samples per word, 
the data order would be: (j4.__,ti>i«_,ti),(-4i/i,t_ A„ 2 , t2 ), — , 
(A V3 ,t 1 A Vi ,t 1 ),(A h , 3t t 2 A,j i ,t2),—, where brackets denote data 
within a machine word. This format means that time delays 
are always applied in units of whole words, avoiding the need 
to deal with intra- word delays. 



2 |http : //developer ■ nvidia . com/ob j ect /gpucomput i ng . html| 

3 The current version ot the (JUDA programming guide is 
available for download at: http://www.nvidia.com/object/cuda_ 
develop.html 
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The thread decomposition was written to allow the 
shape of the block (i.e., number of DMs or time samples per 
block) to be tuned. We found that for a block size of 256 
threads, optimal performance on a Fermi GPU was achieved 
when this was divided into 8 time samples x 32 DMs. We 
interpreted this result as a cache-related effect, where the 
block shape determines the spread of memory locations ac- 
cessed by a group of neighbouring threads spread across 
time-DM space, and the optimum occurs when this spread 
is minimised. On pre-Fermi GPUs, changing the block shape 
was found to have very little impact on performance. 

To minimise redundant computations, the functions 
DM(d) and AT(v) were pre-computed and stored in look- 
up tables for the given dispersion measures and frequency 
channels respectively. Delays were then computed simply by 
retrieving values from the two tables and evaluating equa- 
tion IjSjl, requiring only a single multiplication and a round- 
ing operation. On pre-Fermi GPUs, the table corresponding 
to AT(v) was explicitly stored in the GPU's constant mem- 
ory space, which provides extremely efficient access when 
all threads read the same value (this is always the case for 
our implementation, where frequency channels are traversed 
sequentially). On Fermi-generation cards, this explicit use of 
the constant memory space is unnecessary - constant mem- 
ory caching is used automatically when the compiler deter- 
mines it to be possible. 

To amortize overheads within the GPU kernel such as 
index calculations, loop counters and time-delay computa- 
tions, we allowed each thread to store and sum multiple time 
samples. Processing four samples per thread was found to 
significantly reduce the total arithmetic cost without affect- 
ing memory throughput. Increasing this number required 
more registers per thread (a finite resource), and led to di- 
minishing returns; we found four to be the optimal solution 
for our implementation. 

Our implementation was written to support a chan- 
nel "kill mask", which specifies which frequency chan- 
nels should be included in the computation and which 
should be skipped (e.g., to avoid radio frequency in- 
terference present within them). While our initial ap- 
proach was to apply this mask as a conditional state- 
ment [e.g., if ( kill_mask [channel] ) { sum += data }], 
it was found that applying the mask arithmetically (e.g., 
sum += data * kill_mask [channel] ) resulted in better 
performance. This is not particularly surprising given the 
GPU hardware's focus on arithmetic throughput rather than 
branching operations. 

Finally, we investigated the possibility of using tempo- 
ral caching, as discussed in the analysis in Section [2. 2| Un- 
like most CPUs, GPUs provide a manually-managed cache 
(known as shared memory on NVIDIA GPUs). This provides 
additional power and flexibility at the cost of programming 
effort. We used shared memory to stage a rectangular sec- 
tion of input data (i.e., of time-frequency space) in each 
thread block. Careful attention was given to the amount of 
data cached, with additional time samples being loaded to 
allow for differences in delay across a block. The cost of de- 
velopment was significant, and it remained unclear whether 
the caching mechanism could be made robust against a va- 
riety of input parameters. Further, we found that the over- 
all performance of the code was not significantly altered by 
the addition of the temporal caching mechanism. We con- 



cluded that the additional overheads involved in handling 
the non-linear memory access patterns (i.e., the mapping 
of blocks of threads in time-DM space to memory in time- 
frequency space) negated the performance benefit of staging 
data in the shared cache. We note, however, that cacheing 
may prove beneficial when considering only low DMs (e.g., 
below the diagonal), where delays vary slowly and memory 
access patterns remain relatively compact. 

In theory it is possible that, via careful re-packing of 
the input data, one could exploit the bit-level parallelism 
available in modern computing hardware in addition to the 
thread-level parallelism. For example, for 2-bit data, pack- 
ing each 2-bit value into 8-bits would allow four values to 
be summed in parallel with a single 32-bit addition instruc- 
tion. In this case, §233; = 85 additions could be performed 
before one risked integer overflow. To dedisperse say 1024 
channels, one could first sum blocks of 85 channels and then 
finish the summation by re-packing the partial sums into 
a larger data type. This would achieve efficient use of the 
available processing hardware, at the cost of additional im- 
plementation complexity and overheads for re-packing and 
data management. We did not use this technique in our GPU 
dedispersion codes, although our reference CPU code does 
exploit this extra parallelism by packing four 2-bit samples 
into a 64-bit word before dedispersion. 



3 TREE DEDISPERSION 
3.1 Introduction 



The tree dedispersion algorithm, first described by |Taylor| 
(19741, attempts to reduce the complexity of the dedisper- 
sion computation from O(NtN^NuM) to 0(NtN l/ \og N v ). 
This significant speed-up is obtained by first regularising the 
problem and then exploiting the regularity to allow repeated 
calculations to be shared between different DMs. While the- 
oretical speed-ups of 0(100) are possible, in practice a num- 
ber of additional overheads arise when working with real 
data. These overheads, as well as its increased complexity, 
have meant that the tree algorithm is rarely used in modern 
search pipelines. In this work we investigate the tree algo- 
rithm in order to assess its usefulness in the age of many-core 
processors. 

In its most basic form, the tree dedispersion algorithm 
is used to compute the following: 



+ At'(d',f) i 



At'(d' , v) = round I d' 



N v - 1 



(7) 



(8) 



for d' in the range ^ d' < N v . The regularisation is such 
that the delay function At' (d, v) is now a linear function of 
v that ranges from to exactly d! across the band. The DMs 
computed by the tree algorithm are therefore: 



DM(d') = 



d' 



AT{N„ - 1) : 



(9) 



where the function AT(v) is that given by equation Q. 

The tree algorithm is able to evaluate equation |7| for 
d! in the range ^ d! < N v in just log 2 N v steps. It achieves 
this feat by using a divide and conquer approach in the same 
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Figure 2. Visualisation of the tree dedispersion algorithm. Rect- 
angles represent frequency channels, each containing a time series 
going 'into the page'. Arrows indicate the flow of data, triangles 
represent addition operations and circles indicate unit time delays 
into the page. 



way as the well-known fast Fourier transform (FFT) algo- 
rithm. The tree algorithm is visualised in Fig. [2] We define 
the computation at each step i as follows: 



4° 


= A V:t 






(10) 


Ai +1 

A 2v,t 




,2v),t + ^<S>(i,2u- 


rl),t + e(i,2ix) 


(11) 


A i+1 


= 


,2v),t + ^<S>(i,2u- 


rl),t + e(i,2iy + l) 


(12) 


D'd>,t 


= D' v ,t 


, log 2 N v 




(13) 



The integer function 0(i, u) gives the time delay for a given 
iteration and frequency channel and can be defined as 

Q(i,v) = [(*/mod2 l+1 ) + l]/2, (14) 

where mod is the modulus operator, and division is taken to 
be truncated integer division. The integer function $(z, v), 
which we refer to as the 'shuffle' function, re-orders the in- 
dices v according to a pattern defined as follows: 

<3>(r, v) = (y mod 2)xr+- + — xr, (15) 
2 2r 

where the parameter r = 2* is known as the radix. 

While the tree dedispersion algorithm succeeds in dra- 
matically reducing the computational cost of dedispersion, 
it has a number of constraints not present in the direct al- 
gorithm: 

(i) the computed dispersion trails are linear in frequency, 
not quadratic as found in nature [see equation Q], 

(ii) the computed dispersion measures are constrained to 
those given by equation j9j, and 

(iii) the number of input frequency channels N v (and thus 
also the number of DMs) must be a power of two. 



is common for the number of frequency channels to be a 
power of two, and blank channels can be added when this is 
not the case. Constraints (i) and (ii) are more problematic, 
as they prevent the computation of accurate and efficiently- 
distributed dispersion trails. Fortunately there are ways of 
working around these limitations. 

One method is to approximate the dispersion trail with 
piecewise linear segments by dividing the input data into 
sub-bands ( Manchester et al.|1996 l. Another approach is to 
quadratically space the input frequencies by padding with 
blank channels as a pre-processing step such that the second 



order term in the dispersion trail is removed (Manchester 
|et al.|2001[ ). These techniques are described in the next two 
sections. 



3.1.1 The piecewise linear tree method 

Approximation of the quadratic dispersion curve using 
piecewise linear segments involves two stages of computa- 
tion. If the input data are divided into N s sub-bands of 
length 



(16) 



with the n th sub-band starting at frequency channel 

v n =nK, (17) 

then from equation |7| we see that the tree dedispersion 
algorithm applied to each sub-band results in the following: 



+i/',t+Ai'(d', v n +v') i 



(18) 



which we refer to as stage 1 of the piecewise linear tree 
method. 

In each sub-band, we approximate the quadratic disper- 
sion trail with a linear one. We compute the linear DM in 
the n th sub-band that approximates the true DM indexed 
by d as follows: 



d'„(d) = At(d,v n+1 ) -At(d,v n ) 
= round (DM(d) [AT(i/„+i) 



AT(i/„) 



(19) 
(20) 



Applying the constraint d' n < N' v and noting that the great- 
est dispersion delay occurs at the end of the band, we obtain 
a limit on the DM that the basic piecewise linear tree algo- 
rithm can compute. This limit is commonly referred to as 
the 'diagonal' DM, as it corresponds to a dispersion trail in 
the time-frequency grid with a gradient of unity]^] 



DM 



(piecewise) 
diag 



K 



AT(N„) - AT(iV„ - N' v ) ' 



(21) 



A technique for computing larger DMs with the tree algo- 
rithm is discussed in Section [3.1.3l 



Note that the 'i' in equation (12 1 1 arises from the round-to- 



Constraint (iii) is generally not a significant concern, as it nearest operation in equation (T20p 
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The dedispersed sub-bands can now be combined to ap- 
proximate the result of equation pi: 



n,d' n (d),t + At'^(d) ! 



(22) 



At"(d) = round ( DM(d) ^ AT(i/ ra+1 ) - AT(i 



(23) 

This forms stage 2 of the piecewise linear tree computation. 

The use of the tree algorithm with sub-banding intro- 
duces an additional source of smearing into the dedispersed 
time series as a result of approximating the quadratic disper- 
sion curve with a piecewise linear one. We derive an analytic 
upper limit for this smearing in Appendix [A] 

3.1.2 The frequency-padded tree method 

An alternative approach to adapting the tree algorithm to 
quadratic dispersion curves is to linearise the input data via 
a change of frequency coordinates. Formally, the aim is to 
'stretch' AT(v) [equation to a linear function AT'(f') oc 
v' . Expanding to first order around v — 0, we have: 



ATV) = AT(0) + AT(y') [AT(v 



(24) 



The change of variables v — > v' is then found by equating 
AT(y) with its linear approximation, AT'(f'), and solving 
for u'(v), which gives 



round 



1 vg 

2 Av 



1 



1+ v 



(25) 



Evaluating at v — N v gives the total number of frequency 
channels in the linearised coordinates, which determines the 
additional computational overhead introduced by the pro- 
cedure. Note, however, that this number must be rounded 
up to a power of two before the tree dedispersion algorithm 
can be applied. For observations with typical effective band- 
widths and channel counts that are already a power of two, 
the frequency padding technique is unlikely to require in- 
creasing the total number of channels by more than a factor 
of two. 

In practice, the linearisation procedure is applied by 
padding the frequency dimension with blank channels such 
that the real channels are spaced according to equation ( |25[ ). 
Once the dispersion trails have been linearised, the tree al- 
gorithm can be applied directly. 

The 'diagonal' DM when using the frequency padding 
method corresponds to 

(padded) _ 1 



DM 



diag 



AT(1)' 



. 1.3 Computing larger DMs 



The basic tree dedispersion algorithm computes exactly the 
DMs specified by equation Q. In practice, however, it is 
often necessary to search a much larger range of dispersion 
measures. Fortunately, there are techniques by which this 
can be achieved without having to resort to using the direct 
method. The tree algorithm can be made to compute higher 



DMs by first transforming the input data and then repeat- 
ing the dedispersion computation. Formally, the following 
sequence of operations can be used to compute an arbitrary 
range of DMs: 

(i) Apply the tree algorithm to obtain DMs from zero to 

DMdiag. 

(ii) Impose a time delay across the band. 

(iii) Apply the tree algorithm to obtain DMs from DMdi ag 

tO 2DM d iag. 

(iv) Increment the imposed time delay. 

(v) Repeat from step (ii) to obtain DMs up to 2 JV DMdi ag - 

The imposed time delay is initially a simple diagonal across 
the band (i.e., At — v), and is implemented by incrementing 
a memory stride value rather than actually shifting memory. 
While this method enables dedispersion up to arbitrarily 
large DMs, it does not alter the spacing of DM trials, which 
remains fixed as per equation |9j. 

The 'time-scrunching' technique, discussed in Section 
|2. 1| for the direct algorithm, can also be applied to the tree 
algorithm. The procedure is as follows: 

(i) Apply the tree algorithm to obtain DMs from zero to 

DMdiag. 

(ii) Impose a time delay across the band. 

(iii) Apply the tree algorithm to obtain DMs from DMdi ag 

tO 2DM d iag. 

(iv) Compress ('scrunch') time by a factor of 2 by sum- 
ming adjacent samples. 

(v) Impose a time delay across the band. 

(vi) Apply the tree algorithm to obtain DMs from 
2DM diag to 4DM diag . 

(vii) Repeat from step (iv) to obtain DMs up to 

2 iV DMdiag. 

As with the direct algorithm, the use of time-scrunching pro- 
vides a performance benefit at the cost of a minor reduction 
in the signal-to-noise ratio for pulses of intrinsic width near 
the dispersion measure smearing time. 



3.2 Algorithm analysis 

The tree dedispersion algorithm's computational complex- 
ity of 0(N t N u log N v ) breaks down into log 2 N u sequential 
steps, with each step involving the computation of 0(N t N l/ ) 
independent new values, as seen in equations (|10[) to (] 13[> . 

res & 



Following the analysis methodology of Barsdell, Barnes & 
Fluke (20101, the algorithm therefore has a depth complex- 
ity of O(logAv), meaning it contains this many sequentially- 
dependent operations. Interestingly, this result matches that 
of the direct algorithm, although the tree algorithm requires 
significantly less total work. From a theoretical perspective, 
this implies that the tree algorithm contains less inherent 
parallelism than the direct algorithm. In practice, however, 
the number of processors will be small relative to the size of 
the problem (NtN u ), and this reduced inherent parallelism 
is unlikely to be a concern for performance except when pro- 
cessing very small data-sets. 

Branching (i.e., conditional statements) within an al- 
gorithm can have a significant effect on performance when 



targetting GPU-like hardware (Barsdell, Barnes & Fluke 



|2010[ > . Fortunately, the tree algorithm is inherently branch- 
free, with all operations involving only memory accesses and 
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arithmetic operations. This issue is therefore of no concern 
in this instance. 

The arithmetic intensity of the tree algorithm is de- 
termined from the ratio of arithmetic operations to mem- 
ory operations. To process NtN^ samples, the algorithm in- 
volves N t N u log 2 N v 'delay and add' operations, and pro- 
duces NtN u samples of output. In contrast to the direct 
algorithm, where the theoretical arithmetic intensity was 
proportional to the number of DMs computed, the tree algo- 
rithm requires only 0(\ogN v ) operations per sample. This 
suggests that the tree algorithm may be unable to exploit 
GPU-like hardware as efficiently as the direct algorithm. 
However, the exact arithmetic intensity will depend on con- 
stant factors and additional arithmetic overheads, and will 
only become apparent once the algorithm has been imple- 
mented. We defer discussion of these results to Section \3. 31 



Achieving peak arithmetic intensity requires reading in- 
put data from 'slow memory' into 'fast memory' (e.g., from 
disk into main memory, from main memory into cache, from 
host memory into GPU memory etc.) only once, before per- 
forming all computations within fast memory and writing 
the results, again just once, back to slow memory. In the tree 
dedispersion algorithm, this means performing all log 2 N v 
steps entirely within fast memory. The feasibility of this will 
depend on implementation details, the discussion of which 
we defer to Section [3. 3| However, it will be useful to assume 
that some sub-set of the total computation will fit within 
this model. We will therefore continue the analysis of the 
tree algorithm under the assumption that we are computing 
only a (power-of-two) subset, or block, of B v channels. 

The memory access patterns within the tree algorithm 



resemble those of the direct algorithm (see Section 2.2| 



Time samples are always accessed contiguously, with an off- 
set that is essentially arbitrary. In the frequency dimension, 
memory is accessed according to the shuffle function [equa- 
tion ( 15 1] depicted in Fig. |5J where at any given step of the 
algorithm the frequency channels 'interact' in pairs, the in- 
teraction involving their addition with different time delays. 

With respect to the goal of achieving peak arithmetic in- 
tensity, the key issue for the memory access patterns within 
the tree algorithm is the extent to which they remain 'com- 
pact'. This is important because it determines the ability 
to operate on isolated blocks of data independently, which 
is critical to prolonging the time between successive trips 
to slow memory. In the frequency dimension, the computa- 
tion of some local (power-of-two) sub-set of channels B v in- 
volves accessing only other channels within the same subset. 
In this sense we can say that the memory access patterns 
are 'locally compact' in channels. In the time dimension, 
however, we note that the algorithm applies compounding 
delays (equivalent to offsets in whole time samples). This 
means that the memory access patterns 'leak' forward, with 
any local group of time samples always requiring access to 
the next group. The amount by which the necessary delays 
'leak' in time for each channel is given by the integrated 
delay in that channel after B v steps (see Fig. |5| . The total 
integrated delay across B v channels is B V (B V — l)/2, which 
is the number of additional values that must be read into 
fast memory by the block in order to compute all log 2 B v 
steps without needing to return to global memory and apply 
a global synchronisation. 



3.3 Implementation Notes 

As with the direct algorithm, we implemented the tree al- 
gorithm on a GPU in C for CUDA. For our first attempt, 
we took a simple approach where each of the log 2 N u steps 
in the computation was performed by a separate call to a 
GPU function (or kernel). This approach is not ideal, as 
it is preferable to perform more computation on the device 
before returning to the host (as per the discussion of arith- 
metic intensity in Section 3.2 1, but was necessary in order 



to guarantee global synchronisation across threads on the 
GPU between steps. This is a result of the lack of global 
synchronisation mechanisms on current GPUs. 

Between steps, the integer delay and shuffle functions 



[equations ( 14 1 and ( 15 1] were evaluated on the host and 
stored in look-up tables. These were then copied to con- 
stant memory on the device prior to executing the kernel 
function to compute the step. The use of constant memory 
ensured retrieval of these values would not be a bottle-neck 
to performance during the computation of each step of the 
tree algorithm. 

The problem was divided between threads on the GPU 
by allocating one thread for every time sample and every 
pair of frequency channels. This meant that each thread 
would compute the delayed sums between two 'interacting' 
channels according to the pattern depicted in Fig. [2] for the 
current step. 

The tree algorithm's iterative updating behaviour re- 
quires that computations at each step be performed 'out- 
of-place'; i.e., output must be written to a memory space 
separate from that of the input to avoid modifying input 
values before they have been used. We achieved this effect 
by using a double-buffering scheme, where input and output 
arrays are swapped after each step. 

While the algorithms differ significantly in their details, 
one point of consistency between the direct and tree methods 
is the need to apply time delays to the input data. There- 
fore, just as with our implementation of the direct algorithm, 
the tree algorithm requires accessing memory locations that 
are not aligned with internal memory boundaries. As such, 
we took the same approach as before and mapped the in- 
put data to the CPU's texture memory before launching 
the device kernel. As noted in Section |2.3| this procedure 
is unnecessary on Fermi-generation GPUs, as their built-in 
caches provide the same behaviour automatically. 

After successfully implementing the tree algorithm on a 
GPU using a simple one-step-per-GPU-call approach, we in- 
vestigated the possibility of computing multiple steps of the 
algorithm on the GPU before returning to the CPU for syn- 
chronisation. This is possible because current GPUs, while 
lacking support for global thread synchronisation, do sup- 
port synchronisation across local thread groups (or blocks). 
These thread blocks typically contain 0(100) threads, and 
provide mechanisms for synchronisation and data-sharing, 
both of which are required for a more efficient tree dedisper- 
sion implementation. 

As discussed in Section |3.2[ application of the tree al- 
gorithm to a block of B v channels x B t time samples re- 
quires caching additional values from the next block in time. 
We used blocks of B v x B t = 16 x 16 threads, each load- 
ing both their corresponding data value and required addi- 
tional values into shared cache. Once all values have been 
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stored, computation of the log 2 B u = 4 steps proceeds en- 
tirely within the shared cache. Using larger thread blocks 
would allow more steps to be completed within the cache; 
however, the choice is constrained by the available volume 
of shared memory (typically around 48kB). Once the block 
computation is complete, subsequent steps must be com- 
puted using the one-step-per-GPU-call approach described 
earlier, due to the requirement of global synchronisations. 

While theory suggests that an implementation of the 
tree algorithm exploiting shared memory to perform mul- 
tiple steps in cache would provide a performance benefit 
over a simpler implementation, in practice we were unable 
to achieve a net gain using this approach. The limitations 
on block size imposed by the volume of shared memory, the 
need to load additional data into cache and the logarith- 
mic scaling of steps relative to data size significantly reduce 
the potential speed-up, and overheads from increased code- 
complexity quickly erode what remains. For this reason we 
reverted to the straight-forward implementation of the tree 
algorithm as our final code for testing and benchmarking. 

In addition to the base tree algorithm, we also imple- 
mented the sub-band method so as to allow the computation 
of arbitrary dispersion measures. This was achieved by di- 
viding the computation into two stages. In stage 1, the first 
log 2 N' u steps of the tree algorithm are applied to the input 
data, which produces the desired N v /N' u tree-dedispersed 
sub-bands. Stage 2 then involves applying an algorithm to 
combine the dedispersed time series in different sub-bands 
into approximated quadratic dispersion curves according to 
equation ( 22 1 . Stage 2 was implemented on the GPU in 



much the same way as the direct algorithm, with input data 
mapped to texture memory (on pre-Fermi GPUs) and delays 
stored in look-up tables in constant device memory. 

The frequency padding approach described in Sec- 
tion [3TL2] was implemented by constructing an array large 
enough to hold the stretched frequency coordinates, initial- 
ising its elements to zero, and then copying (or scattering) 
the input data into this array according to equation (251. 



The results of this procedure were then fed to the basic tree 
dedispersion code to produce the final set of dedispersed 
time series. 

Because the tree algorithm involves sequentially updat- 
ing the entire data-set, the data must remain in their final 
format for the duration of the computation. This means that 
low bit-rate data, e.g., 2-bit, must be unpacked (in a pre- 
processing step) into a format that will not overflow during 
accumulation. This is in contrast to the direct algorithm, 
where each sum is independent, and can be stored locally to 
each thread. 



4 SUB-BAND DEDISPERSION 
4.1 Introduction 

Sub-band dedispersion is the name given to another tech- 
nique used to compute the dedispersion transform. Like the 
tree algorithm described in Section |3j the sub-band algo- 
rithm attempts to reduce the cost of the computation rela- 
tive to the direct method; however, rather than exploiting 
a regularisation of the dedispersion algorithm, the sub-band 
method takes a simple approximation approach. 



In its simplest form, the algorithm involves two process- 
ing steps. In the first, the set of trial DMs is approximated 
by a reduced set of -/VoMnom = Nbm/N^ m 'nominal' DMs, 
each separated by N^ M trial dispersion measures. The direct 
dedispersion algorithm is applied to sub-bands of N' v chan- 
nels to compute a dedispersed time series for each nominal 
DM and sub-band. In the second step, the DM trials near 
each nominal value are computed by applying the direct al- 
gorithm to the 'miniature filterbanks' formed by the time 
series for the sub-bands at each nominal DM. These data 
have a reduced frequency resolution of JVsb = N v /N' v chan- 
nels across the band. The two steps thus operate at reduced 
dispersion measure and frequency resolution respectively, re- 
sulting in an overall reduction in the computational cost. 

The sub-band algorithm is implemented in the presto 
software suite l| Ransom 2001 1 and was recently implemented 



on a GPU by |Magro et al. (2011 \ (see Section 6.1 for a com- 
parison with their work). Unlike the tree algorithm, the sub- 
band method is able to compute the dedispersion transform 
with the same flexibility as the direct method, making its 
application to real observational data significantly simpler. 

The approximations made by the sub-band algorithm 
introduce additional smearing into the dedispersed time se- 
ries. We derive an analytic upper-bound in Appendix [B] and 
show that, to first order, the smearing time isB is propor- 
tional to the product N^ M Ni [see equation |B2|)]. 



4.2 Algorithm analysis 

The computational complexity of the sub-band dedispersion 
algorithm can be computed by summing that of the two 
steps: 



2SB,1 = NsB • Tdlroct(JVt) N' v , -/VoMnom) 
TsB,2 = A?DMnom ■ Tdircct ( Nt , iVsB , N^m) 
TsB = TsB.l + 2SB,2 



= O 



N t NnMN v 



AS 



1 1 



(27) 
(28) 
(29) 

(30) 



This result can be combined with knowledge of the smear- 
ing introduced by the algorithm to probe the relationship 
between accuracy and performance. Inserting the smearing 
constraint tss oc N^ M N^ (see Section 4.1l into equation 
(30 1, we obtain a second-order expression that is minimised 
at N^ M = N' v oc y/tss, which amounts to balancing the ex- 
ecution time between the two steps. This result optimises 
the time complexity of the algorithm, which then takes the 
simple form 



TL 



o 



f N OM N„ 



(31) 



and represents a theoretical speed-up over the direct al- 
gorithm proportional to the square root of the introduced 
smearing. 

The sub-band algorithm's dependence on the direct 
algorithm means that it inherits similar algorithmic be- 
haviour. However, as with the tree method, the decrease 
in computational work afforded by the sub-band approach 
corresponds to a decrease in the arithmetic intensity of the 
algorithm. This can be expected to reduce the intrinsic per- 
formance of the two sub-band steps relative to the direct 
algorithm. 
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One further consideration for the sub-band algorithm 
is the additional volume of memory required to store the 
intermediate results produced by the first step. These data 
consist of time series for each sub-band and nominal DM, 
giving a space complexity of 



M SB = O (NqbNom nora } 



(32) 



Assuming the time complexity is optimised as in equation 
(311, the space complexity becomes 



Mi B =0 



tssj 



(33) 



which indicates that the memory consumption increases 
much faster than the execution time, in direct proportion 
to the introduced smearing rather than to its square root. 
This can be expected to place a lower limit on the smearing 
that can be achieved in practice. 



equations for the smearing during the direct dedispersion 
proces^] assuming an intrinsic pulse width of 40/is. 

For the piecewise linear tree algorithm, the effective sig- 
nal smearing at low dispersion measure is dominated by the 
intrinsic pulse width, the sampling time At and the effect 
of finite DM sampling. As the DM is increased, however, 
the effects of finite channel width and the sub-band tech- 
nique grow, and eventually become dominant. These smear- 
ing terms both scale linearly with the dispersion measure, 
and so the relative contribution of the sub-band method, 
USB, tends to a constant. 

The sub-band algorithm exhibits virtually constant 
smearing as a function of DM due to its dependence on 
the DM step, which is itself chosen to maintain a fixed frac- 
tional smearing. While the general trend mirrors that of the 
tree algorithm, the sub-band algorithm's smearing is typi- 
cally around two orders of magnitude worse than its tree 
counterpart. 



4.3 Implementation notes 

A significant advantage of the sub-band algorithm over the 
tree algorithm is that it involves little more than repeated 
execution of the direct algorithm. With sufficient general- 
isatiorj^] of our implementation of the direct algorithm, we 
were able to implement the sub-band method with just two 
consecutive calls to the direct dedispersion routine and the 
addition of a temporary data buffer. 

In our implementation, the 'intermediate' data (i.e., the 
outputs of the first step) are stored in the temporary buffer 
using 32 bits per sample. The second call to the dedispersion 
routine then reads these values directly before writing the 
final output using a desired number of bits per sample. 

Experimentation showed that optimal performance oc- 
curred at a slightly different shape and size of the thread 
blocks on the GPU compared to the direct algorithm (see 
Section [2.2[ ). The sub-band kernels operated most efficiently 
with 128 threads per block divided into 16 time samples and 
8 DMs. In addition, the optimal choice of the ratio Nl/N^ M 
was found to be close to unity, which matches the theoretical 
result derived in Section |4.2| While these parameters min- 
imised the execution time, the sub-band kernels were still 
found to perform around 40% less efficiently than the direct 
kernel. This result is likely due to the reduced arithmetic 



intensity of the algorithm (see Section 4.2 \ 



5 RESULTS 
5.1 Smearing 

Our analytic upper-bounds on the increase in smearing due 



to use of the piecewise linear tree algorithm [equation (A6)] 
and the sub-band algorithm [equation | |B3[ )] are plotted in 
the upper panels of Fig s. [3|a nd[4| respectively. The reference 
point [W in equations |A6l and (|B3[|] was calculated using 



5.2 Performance 

Our codes as implemented allowed us to directly compute 
the following: 

• Any list of DMs using the direct or sub-band algorithm 
with no time-scrunching, 

• DMs up to the diagonal [see equation ( |21[ )] using the 
piecewise linear tree algorithm, and 



• DMs up to the diagonal [see equation ( 26 1] using the 
frequency-padded tree algorithm. 

A number of timing benchmarks were run to compare 
the performance of the CPU to the GPU and the direct 
algorithm to the tree algorithms. Input and dispersion pa- 
rameters were chosen to reflect a typical scenario as appears 
in modern pulsar surveys such as the High Time Resolution 
Universe (HTRU) survey currently underway at the Parkes 
radio telescope ( Keith et al. 2010 1. The benchmarks involved 



computing the dedispersion transform of one minute of input 
data with observing parameters of bits/sample = 2, Va = 
1581.8MHz, Au = -0.39062MHz, N v = 1024, At = 64/^s. 
DM trials were chosen to match those used in the HTRU 
survey, which were originally derived by applying an ana- 
lytic constraint on the signal-smearing due to incorrect trial 
D^Q The chosen set contained 1196 trial DMs in the range 
^ DM < 1000 pc cm -3 with approximately exponential 
spacing. 

For comparison purposes, we benchmarked a reference 
CPU direct dedispersion code in addition to our GPU codes. 
The CPU code (named dedisperse_all) is highly opti- 
mised, and uses multiple CPU cores to compute the dedis- 
persion transform (parallelised over the time dimension) in 
addition to bit-level parallelism as described in Section [2. 3| 
dedisperse_ALL is approximately 60 x more efficient than 
the generic dedisperse routine from SlGPROcQ but is only 
applicable to a limited subset of data formats. 



5 The direct dedispersion routine was modified to support 'batch- 
ing' (simultaneous application to several adjacent data-sets) and 
arbitrary strides through the input and output arrays, trial DMs 
and channels. 



6 Levin, L. 2011, priv. comm. 

Levin, L. 2011, priv. comm.; see |Cordes fc McLaughlin|2003| for 
a similar derivation. 
8 sigproc.sourceforge.net 
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Figure 3. Upper: Analytic upper-bound on signal degradation of a 40/is pulse due to the piecewise linear tree algorithm as a function of 
the number of channels per sub-band [see equation ( |A6| ] . Lower: Performance results for the direct and piecewise linear tree algorithms 
with (a) and without (b) 'time-scrunching' applied. Benchmarks were executed on an Intel Core i7 930 quad-core CPU and NVIDIA 
Tesla C1060, Tesla C2050 and GeForce GTX 480 GPUs. All results correspond to operations on one minute of input data with the 
following observing parameters: bits/sample = 2, u c = 1381.8MHz, BW = 400MHz, N v = 1024, At = 64/xs. A total of 1196 DM trials 
were used, spaced non-linearly in the range ^ DM < 1000 pc cm -3 (see text for details). Error bars are too small to be seen at this 
scale and are not plotted. Note that performance results are projected from measurements of codes performing sub-sets of the benchmark 
task (see text for details). 



At the time of writing, our dedispersion code-base did 
not include 'full-capability' implementations of all of the 
discussed algorithms. However, we were able to perform a 
number of benchmarks that were sufficient to obtain accu- 
rate estimates of the performance of complete runs. Tim- 
ing measurements for our codes were projected to produce 
a number of derived results representative of the complete 
benchmark task. The direct/sub-band dedispersion code was 
able to compute the complete list of desired DMs, but was 
not able to exploit time-scrunching; results for these algo- 
rithms with time scrunching were calculated by assuming 
that the computation of DMs between 2x and 4x the di- 
agonal would proceed twice as fast as the computation up 
to 2 x the diagonal (as a result of there being half as many 
time samples), and similarly for 4x to 8x etc. up to the 
maximum desired DM. A simple code to perform the time- 
scrunching operation (i.e., adding adjacent time samples to 
reduce the time resolution by a factor of two) was also bench- 
marked and factored into the projection. For the tree codes, 
which were unable to compute DMs beyond the diagonal, 
timing results were projected by scaling as appropriate for 
the computation of the full set of desired DMs with or with- 



out time-scrunching. Individual sections of code were timed 
separately to allow for different scaling behaviours. 

Benchmarks were run on a variety of hardware configu- 
rations. CPU benchmarks were run on an Intel i7 930 quad- 
core CPU (Hyperthreading enabled). GPU benchmarks were 
run using version 3.2 of the CUDA toolkit on the pre-Fermi 
generation NVIDIA Tesla C1060 and the Fermi generation 
NVIDIA Tesla C2050 (error-correcting memory disabled) 
and GeForce GTX 480 GPUs. Hardware specifications of 
the GPUs' host machines varied, but were not considered to 
significantly impact performance measurements other than 
the copies between host and GPU memory. Benchmarks for 
these copy operations were averaged across the different ma- 
chines. 

Our derived performance results for the direct and 
piecewise linear tree codes are plotted in the lower panels 
of Fig. [3] The performance of the frequency-padded tree 
code corresponded to almost exactly half that of the piece- 
wise linear tree code at a sub-band size of N' v = 1024; these 
results were omitted from the plot for clarity. 

Performance results for the sub-band dedispersion code 
are plotted in the lower panels of Fig.[4]along with the results 
of the direct code for comparison. Due to limits on memory 
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Figure 4. Upper: Analytic upper-bound on signal degradation of a 40/iS pulse due to the sub-band algorithm as a function of the 
number of channels per sub-band [see equation JB3J|] . Lower: Performance results for the direct and sub-band methods with (a) and 
without (b) the use of time-scrunching. See Fig. [3] caption for details. Note that it was not possible to run benchmarks of the sub-band 
code for N' v < 16 due to memory constraints. 



use (see Section 4.2 1, benchmarks for N' v < 16 were not 
possible. 

Performance was measured by inserting calls to the 
Unix function gettimeofday() before and after relevant sec- 
tions of code. Calls to the CUD A function cudaThreadSyn- 
chronize() were inserted where necessary to ensure that 
asynchronous GPU functions had completed their execution 
prior to recording the time. 

Several different sections of code were timed indepen- 
dently. These included pre- and post-processing steps (e.g., 
unpacking, transposing, scaling) and copies between host 
and GPU memory (in both directions), as well as the dedis- 
persion kernels themselves. Disk I/O and portions of code 
whose execution time does not scale with the size of the in- 
put were not timed (see Section [6] for a discussion of the 
impact of disk I/O). Timing results represent the total ex- 
ecution time of all timed sections, including memory copies 
between the host and the device in the case of the GPU 
codes. 

Each benchmark was run 101 times, from which the 
the median execution time was chosen as the final measure- 
ment. Recorded uncertainties corresponded to the 5 th and 
95 th percentiles; the error bars are too small to be seen in 
Figs. [3] and [4] and were not plotted. 



6 DISCUSSION 



The lower panel of Fig. 3(a) shows a number of interesting 



performance trends. As expected, the slowest computation 
speeds come from the direct dedispersion code running on 
the CPU. Here, some scaling is achieved via the use of mul- 
tiple cores, but the speed-up is limited to around 2.5 x when 
using all four. This is likely due to saturation of the available 
memory bandwidth. 

Looking at the corresponding results on a GPU, a large 
performance advantage is clear. The GTX 480 achieves a 
9x speed-up over the quad-core CPU, and even the last- 
generation Tesla C1060 manages a factor of 5x. The fact 
that a single GPU is able to compute the dedispersion trans- 
form in less than a third of the real observation time makes 
it an attractive option for real-time detection pipelines. 

A further performance boost is seen in the transition 
to the tree algorithm. Computation speed is projected to 
exceed that of the direct code for almost all choices of N' v , 
peaking at around 3x at N' v — 64. Performance is seen to 
scale approximately linearly for N' v < 32, before peaking and 
then decreasing very slowly for N' u > 64. This behaviour is 
explained by the relative contributions of the two stages of 
the computation. For small N' v , the second, 'sub-band com- 
bination', stage dominates the total execution time [scaling 
as 0{1/N' V )]. At large N' u the execution time of the second 
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Table 1. Summary of host<->GPU memory copy times 

Code Copy Time Fraction of total time 

Direct 0.62 s < 5% 

Tree 1.05 s < 30% 

Sub-band 0.62 s 10% - 65% 

stage becomes small relative to the first, and scaling follows 
that of the basic tree algorithm [i.e., O(logiV^)]. 

The results of the sub-band algorithm in Fig. |4(a)| also 
show a significant performance advantage over the direct al- 
gorithm. The computable benchmarks start at N' V =\Q with 
around the same performance as the tree code. From there, 
performance rapidly increases as the size of the sub-bands 
is increased, eventually tailing off around A^=256 with a 
speed-up of approximately 20 x over the direct code. At such 
high speeds, the time spent in the GPU kernel is less than 
the time spent transferring the data into and out of the 
GPU. The significance of this effect for each of the three 
algorithms is given in Table [l] 

The results discussed so far have assumed the use of 
the time-scrunching technique during the dedispersion com- 
putation. If time-scrunching is not used, the projected per- 
formance results change significantly [see lower panels Figs. 
[3(b)1 and [4(b)] . Without the use of time-scrunching, the di- 
rect dedispersion codes perform around 1.6x slower, and 
similar results are seen for the sub-band code. The tree 
codes, however, are much more severely affected, and per- 
form 5x slower when time-scrunching is not employed. This 
striking result can be explained by the inflexibilities of the 
tree algorithm discussed in Section [3.1| At large dispersion 
measure, the direct algorithm allows one to sample DM 
space very thinly. The tree algorithms, however, do not - 
they will always compute DM trials at a fixed spacing [see 
equation j9J], This means that the tree algorithms are effec- 
tively over-computing the problem, which leads to the ero- 
sion of their original theoretical performance advantage. The 
use of time-scrunching emulates the thin DM-space sampling 
of the direct code, and allows the tree codes to maintain an 
advantage. 

While the piecewise linear tree code and the sub-band 
code are seen to provide significant speed-ups over the direct 
code, their performance leads come at the cost of introduc- 
ing additional smearing into the dedispersed signal. Our an- 
alytic results for the magnitude of the smearing due to the 
tree code (upper panels Fig.[3| show that for the chosen ob- 
serving parameters, the total smear is expected to increase 
by less than 10% for all N' v sC 64 at a DM of 1000 pc cm -3 . 
Given that peak performance of the tree code also corre- 
sponded to N' v — 64, we conclude that this is the optimal 
choice of sub-band size for such observations. 

The smearing introduced by the sub-band code (upper 
panels Fig. |4j is significantly worse, increasing the signal 
degradation by three orders of magnitude more than the 
tree code. Here, the total smear is expected to increase by 
around 40% at N^=16, and at Nl=32 the increase in smear- 
ing reaches 300%. While these results are upper limits, it is 
unlikely that sub-band sizes of more than will pro- 

duce acceptable results in practical scenarios. 

In contrast to the piecewise linear code, the frequency- 
padded tree code showed only a modest speed-up of around 
1.5x over the direct approach due to its doubling of the 



number of frequency channels. Given that the sub-band al- 
gorithm has a minimal impact on signal quality, we conclude 
that the frequency-padding technique is an inferior option. 

It is also important to consider the development cost 
of the algorithms we have discussed. While the tree code 
has shown both high performance and accuracy, it is also 
considerably more complex than the other algorithms. The 
tree algorithm in its base form, as discussed in Section [3. 1| 
is much less intuitive than the direct algorithm (e.g., the 
memory access patterns in Fig.[2|. This fact alone makes im- 
plementation more difficult. The situation gets significantly 
worse when one must adapt the tree algorithm to work in 
practical scenarios, with quadratic dispersion curves and ar- 
bitrary DM trials. Here, the algorithm's inflexibility makes 
implementation a daunting task. We note that our own im- 
plementations are as yet incomplete. By comparison, imple- 
mentation of the direct code is relatively straightforward, 
and the sub-band code requires only minimal changes. De- 
velopment time must play a role in any decision to use one 
dedispersion algorithm over another. 

The three algorithms we have discussed each show rela- 
tive strengths and weaknesses. The direct algorithm makes 
for a relatively straightforward move to the GPU archi- 
tecture with no concerns regarding accuracy, and offers a 
speed-up of up to 10 x over an efficient CPU code. How- 
ever, its performance is convincingly beaten by the tree and 
sub-band methods. The tree method is able to provide sig- 
nificantly better performance with only a minimal loss of 
signal quality; however, it comes with a high cost of devel- 
opment that may outweigh its advantages. Finally, the sub- 
band method combines excellent performance with an easy 
implementation, but is let down by the substantial smear- 
ing it introduces into the dedispersed signal. The optimal 
choice of algorithm will therefore depend on which factors 
are most important to a particular project. While there is 
no clear best choice among the three different algorithms, 
we emphasize that between the two hardware architectures 
the GPU clearly outperforms the CPU. 

When comparing the use of a GPU to a CPU, it is in- 
teresting to note that our final GPU implementation of the 
direct dedispersion algorithm on a Fermi-class device is, rel- 
atively speaking, a simple code. While it was necessary in 
both the pre-Fermi GPU and multi-core CPU implementa- 
tions to use non-trivial optimisation techniques (e.g., texture 
memory, bit-packing etc.), the optimal implementation on 
current-generation, Fermi, GPU hardware was also the sim- 
plest or 'obvious' implementation. This demonstrates how 
far the (now rather misnamed) graphics processing unit has 
come in its ability to act as a general-purpose processor. 

In addition to the performance advantage offered by 
GPUs today, we expect our implementations of the dedis- 
persion problem to scale well to future architectures with 
little to no code modification. The introduction of the cur- 
rent generation of GPU hardware brought with it both a sig- 
nificant performance increase and an equally significant re- 
duction in programming complexity. We expect these trends 
to continue when the next generation of GPUs is released, 
and see a promising future for these architectures and the 
applications that make use of them. 

While we have only discussed single-GPU implementa- 
tions of dedispersion, it would in theory be a simple mat- 
ter to make use of multiple GPUs, e.g., via time-division 
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multiplexing of the input data or allocation of a sub-set of 
beams to each GPU. As long as the total execution time 
is dominated by the GPU dedispersion kernel, the effects 
of multiple GPUs within a machine sharing resources such 
as CPU cycles and PCI-Express bandwidth are expected to 
be negligible. However, as shown in Table [I] the tree and 
sub-band codes are in some circumstances so efficient that 
host-f-^device memory copy times become a significant frac- 
tion of the total run time. In these situations, the use of 
multiple GPUs within a single host machine may influence 
the overall performance due to reduced PCI-Express band- 
width. 

Disk I/O is another factor that can contribute to the to- 
tal execution time of a dedispersion process. Typical server- 
class machines have disk read/ write speeds of only around 
100 MB/s, while our GPU dedispersion codes are capable 
of producing 8-bit time series at well over twice this rate. If 
dedispersion is performed in an offline fashion, where time 
series are read from and written to disk before and after 
dedispersion, then it is likely that disk performance will be- 
come the bottle-neck. The use of multiple GPUs within a 
machine may exacerbate this effect. However, for real-time 
processing pipelines where data are kept in memory between 
operations, the dedispersion kernel can be expected to dom- 
inate the execution time. This is particularly important for 
transient search pipelines, where acceleration searching is 
not necessary and dedispersion is typically the most time- 
consuming operation. 

The potential impact of limited PCI-Express bandwidth 
or disk I/O performance highlights the need to remember 
Amdahl's Law when considering further speed-ups in the 
dedispersion codes: the achievable speed-up is limited by 
the largest bottle-neck. The tree and sub-band codes are al- 
ready on the verge of being dominated by the hostodevice 
memory copies, meaning that further optimisation of their 
kernels will provide diminishing returns. While disk and 
memory bandwidths will no-doubt continue to increase in 
the future, we expect the ratio of arithmetic performance to 
memory performance to get worse rather than better. 

The application of GPUs to the problem of dedisper- 
sion has produced speed-ups of an order of magnitude. The 
implications of this result for current and future surveys are 
significant. Current projects often execute pulsar and tran- 
sient search pipelines offline due to limited computational 
resources. This results in event detections being made long 
after the time of the events themselves, limiting analysis and 
confirmation power to what can be gleamed from archived 
data alone. A real-time detection pipeline, made possible by 
a GPU-powered dedispersion code, could instead trigger sys- 
tems to record invaluable baseband data during significant 
events, or alert other observatories to perform follow-up ob- 
servations over a range of wavelengths. Real-time detection 
capabilities will also be crucial for next-generation telescopes 
such as the Square Kilometre Array pathfinder programs 
ASKAP and MeerKAT. The use of GPUs promises signifi- 
cant reductions in the set-up and running costs of real-time 
pulsar and transient processing pipelines, and could be the 
enabling factor in the construction of ever-larger systems in 
the future. 



Table 2. Timing comparisons for direct GPU dedispersion of the 
'toy observation' denned in |Magro et al.| | |2011| l (v c =610 MHz, 
BW=20 MHz, JV„=256, At=12.8|*s, ^1^=500, (KDM<60 pc 
cm -3 ). All benchmarks were executed on a Tesla C1060 GPU. 



Stage 


Magro et al. 


( 


2011 


) 


This work 


Ratio 


Corner turn 
De-dispersion 

GPU->CPU copy 


112 ms 
4500 ms 

220 ms 


7 ms 
1959 ms 

144 ms 


16x 
2.29X 

1.52x 


Total 


4832 ms 


2110 ms 


2.29x 



6.1 Comparison with other work 



Magro et al. (20111 recently reported on a GPU code that 
could achieve very high (> 100 x) speed-ups over the dedis- 
persion routines in SIGPROC and presto (Ransom||2001 1 



whereas our work only finds improvements of factors of 10- 
30 over dedisperse_all. There are two key reasons for the 
apparent discrepancy in speed. Firstly, the SIGPROC routine 
was never written to optimise performance but rather to 
produce reliable dedispersed data streams from a very large 
number of different backends. Inspection of the innermost 
loop reveals a conditional test that prohibits parallelisation, 
and a two dimensional array that is computationally expen- 
sive. Secondly, SIGPROC only produces one DM per file read, 
which is very inefficient. We believe that these factors ex- 
plain the very large speed-ups reported by Magro et al. In 
our own benchmarks, we have found our CPU comparison 
code dedisperse_ALL to be ~ 60 x faster than SIGPROC. For 
comparison, this puts our direct GPU code at ~ 300 x faster 
than SIGPROC when using the same Tesla C1060 model GPU 
as Magro et al. 

Direct comparison of our GPU results with those of 
Magro et al. is difficult, as the details of the CPU code, 
the method of counting FLOP/s and the observing param- 
eters used in their performance plots is not clear. However, 
we have benchmarked our GPU code on the 'toy observa- 
tion' presented in section 5 of their paper. The execution 
times are compared in Table [2] Magro et al. did not specify 
the number of bits per sample used in their benchmark; we 
chose to use 8 bits/sample, but found no significant differ- 
ence when using 32 bits/sample. We found our implementa- 
tion of the direct dedispersion algorithm to be ~ 2.3 x faster 
than that reported in their work. Possible factors contribut- 
ing to this difference include our use of texture memory, two- 
dimensional thread blocks and allocation of multiple samples 
per thread. The performance results of our implementation 
of the sub-band dedispersion algorithm generally agree with 
those of Magro et al., although the impact of the additional 
smearing is not quantified in their work. 

In summary, we agree with Magro et al. that GPUs 
offer great promise in incoherent dedispersion. The benefit 
over that of CPUs is, however, closer to the ratio of their 
memory bandwidths (~ 10 x) than the factor of 100 re- 
ported in their paper, which relied on comparison with a 
non-optimised single-threaded CPU code. 



6.2 Code availability 

We have packaged our GPU implementation of the direct 
incoherent dedispersion algorithm into a C library that we 
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make available to the communitjj^] The application pro- 
gramming interface (API) was modeled on that of the 
FFTW librar£3 which was found to be a convenient fit. 
The library requires the NVIDIA CUDA Toolkit, but places 
no requirements on the host application, allowing easy inte- 
gration into existing C/C+- |-/Fortran etc. codes. While the 
library currently uses the direct dedispersion algorithm, we 
may consider adding support for a tree or sub-band algo- 
rithm in future. 



7 CONCLUSIONS 

We have analysed the direct, tree and sub-band dedisper- 
sion algorithms and found all three to be good matches for 
massively-parallel computing architectures such as GPUs. 
Implementations of the three algorithms were written for the 
current and previous generations of GPU hardware, with the 
more recent devices providing benefits in terms of both per- 
formance and ease of development. Timing results showed 
a 9x speed-up over a multi-core CPU when executing the 
direct dedispersion algorithm on a GPU. Using the tree al- 
gorithm with a piecewise linear approximation technique re- 
sults in some additional smearing of the input signal, but was 
projected to provide a further 3x speed-up at a very modest 
level of signal-loss. The sub-band method provides a means 
of obtaining even greater speed-ups, but imposes significant 
additional smearing on the dedispersed signal. These results 
have significant implications for current and future radio 
pulsar and transient surveys, and promise to dramatically 
lower the cost barrier to the deployment of real-time detec- 
tion pipelines. 



the dedispersion problem, the second derivative of the delay 
function with respect to frequency is given by 



^At(d, v) = DM(d)^AT(v) 



dv 



dv 2 

A 2 

= 6DM(d)fc DM ^r 
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which has greater magnitude at lower frequencies. Evaluat- 
ing at the lowest frequency in the band, v — N v , and substi- 
tuting into equation ( Al I along with the sub-band size N' v , 



one finds the error to be bounded by 
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where A = ^-N u is a proxy for the fractional bandwidth, a 
measure of the width of the antenna band. 

If the smearing as a result of using the direct algorithm 
is quantified as the effective width, W , of an observed pulse, 
then the piecewise linear tree algorithm is expected to pro- 
duce a signal with an effective width of 



w tTCC = yV 2 + t t 2 

giving a relative smearing of 
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fJ-tree 



(A5) 



(A6) 
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In contrast to the use of a piecewise linear approximation, 
the use of a change of frequency coordinates ('frequency 
padding') to linearise the dispersion trails results in no ad- 
ditional sources of smear. 
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APPENDIX A: ERROR ANALYSIS FOR THE 
TREE DEDISPERSION ALGORITHM 

Here we derive an expression for the maximum error intro- 
duced by the use of the piecewise linear tree dedispersion 
algorithm. 

The deviation of a function f(x) from a linear approxi- 
mation between x = xq and x = x 1 is bounded by 



- xo 



max 



df_ 

dx 2 



(Al) 



which shows that the error is proportional to the square of 
the step size and the second derivative of the function. For 



Our library and its source code are available at: |http://| 
dedisp . googlecode . com/ 
10 |http://www.fftw.org[ 



APPENDIX B: ERROR ANALYSIS FOR THE 
SUB-BAND DEDISPERSION ALGORITHM 

Here we derive an expression for the maximum error intro- 
duced by the use of the sub-band dedispersion algorithm. 

The smearing introduced into a dedispersed time series 
due to an approximation to the dispersion curve is bounded 
by the maximum temporal deviation of the approximation 
from the exact curve. The maximum change in delay across 
a sub-band is At(DM, N„) - At(DM, N v - K); the differ- 
ence in this value between two nominal DMs then gives the 
smearing time: 



t S B ^ ADM 

= iVoMADM 



AT(N, 



- AT(N V - K 
J iV, (1 + A) 3 T 



(Bl) 



O 
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where the second form is obtained through Taylor expansion 
in powers of ^f- around zero. Note that this derivation as- 
sumes the dispersion curve is approximated by aligning the 
'early' edge of each sub-band. An alternative approach is to 
centre the sub-bands on the curve, which reduces the smear- 
ing by ~ 2x but adds complexity to the implementation. 

As with the tree algorithm, we can define the relative 
smearing of the sub-band algorithm with respect to the di- 
rect algorithm as 



MSB 



W 



yw 2 +ti B 

w 



(B3) 



Accelerating incoherent dedispersion 

where, as before, W is the effective width of an observed 
pulse after direct dedispersion. 
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